Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and Music
نویسندگان
چکیده
In the context of multimedia content, a modality can be dened as a type of data item such as text, images, music, and videos. Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specied video or vice versa. Moreover, much of the existing research relies on metadata such as keywords, tags, or associated description that must be individually produced and aached posterior. is paper introduces a new content-based, cross-modal retrieval method for video and music that is implemented through deep neural networks. e proposed model consists of a two-branch network that extracts features from the two dierent modalities and embeds them into a single embedding space. We train the network via cross-modal ranking loss such that videos and music with similar semantics end up close together in the embedding space. In addition, to preserve inherent characteristics within each modality, the proposed single-modal structure loss was also used for training. Owing to the lack of a dataset to evaluate cross-modal video-music tasks, we constructed a large-scale video-music pair benchmark. Finally, we introduced reasonable quantitative and qualitative experimental protocols. e experimental results on our dataset are expected to be a baseline for subsequent studies of less-mature video-to-music and music-tovideo related tasks.
منابع مشابه
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the ch...
متن کاملA Music Video Information Retrieval Approach to Artist Identification
We propose a cross-modal approach based on separate audio and image data-sets to identify the artist of a given music video. The identification process is based on an ensemble of two separate classifiers. Audio content classification is based on audio features derived from the Million Song Dataset (MSD). Face recognition is based on Local Binary Patterns (LBP) using a training-set of artist por...
متن کاملLearning Deep Semantic Embeddings for Cross-Modal Retrieval
Deep learning methods have been actively researched for cross-modal retrieval, with the softmax cross-entropy loss commonly applied for supervised learning. However, the softmax cross-entropy loss is known to result in large intra-class variances, which is not not very suited for cross-modal matching. In this paper, a deep architecture called Deep Semantic Embedding (DSE) is proposed, which is ...
متن کاملCross-Modal Learning - The Learning Methodology Inspired by Human’s Intelligence
Human has an amazing cross-modal learning capability. In order to endow the computers with the same ability, we use a model based on the quotient space theory. In the quotient space model, representations at different modalities form a complete semi-order lattice and the translation from one modality to the others becomes easier. Therefore, it is suitable to be a mathematical model of cross-mod...
متن کاملTowards End-to-End Audio-Sheet-Music Retrieval
This paper demonstrates the feasibility of learning to retrieve short snippets of sheet music (images) when given a short query excerpt of music (audio) – and vice versa –, without any symbolic representation of music or scores. This would be highly useful in many content-based musical retrieval scenarios. Our approach is based on Deep Canonical Correlation Analysis (DCCA) and learns correlated...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1704.06761 شماره
صفحات -
تاریخ انتشار 2017